2024-01-23
Incorporating Survey Weights …
Primary Source: https://bookdown.org/rwnahhas/RMPH/survey-design.html
In many surveys, each sampled subject is assigned a weight that is equivalent to the reciprocal of his/her probability of selection into the sample.
\[ \mbox{Sample Subject's Weight} = \frac{1}{Prob(selection)} \]
but more sophisticated sampling designs require more complex weighting schemes. Usually these are published as part of the survey data.
I’ll demonstrate part of the survey package today.
Let’s use the NHANES 2013-14 data and pull in both the demographics (DEMO_H) and total cholesterol (TCHOL_H) databases.
Detailed descriptions available at
Weights for each sampled person in NHANES account for the complex survey design. The weight describes the number of people in the population represented by the sampled person, and is created in three steps:
Source: https://wwwn.cdc.gov/nchs/nhanes/tutorials/Module3.aspx
The DEMO file contains two kinds of sampling weights:
WTINT2yr), andWTMEC2yr)NHANES also provides several weights for subsamples. In NHANES, we identify the variable of interest that was collected on the smallest number of respondents. The sample weight that applies to that variable is the appropriate one to use. In our first case, we will study total cholesterol and use the weights from the MEC exam.
SEQN = subject identifying codeRIAGENDR = sex (1 = M, 2 = F)RIDAGEYR = age (in years at screening, topcode at 80)DMQMILIZ = served active duty in US Armed Forces (yes/no)RIDSTATR = 2 if subject took both interview and MEC examWTMEC2YR - Full sample 2 year MEC exam weightLBXTC = Total Cholesterol (mg/dl) - this is our outcomeThe first 5 are in DEMO_H, and the first and last are in TCHOL_H.
DEMO and TCHOL filesnh1314 <- joined_df |> # has n = 8291
tibble() |>
filter(complete.cases(LBXTC)) |> # now n = 7624
filter(RIDSTATR == 2) |> # still 7624
filter(RIDAGEYR > 19 & RIDAGEYR < 40) |> # now n = 1802
filter(DMQMILIZ < 3) |> # drop 7 = refused, n = 1801
mutate(FEMALE = RIAGENDR - 1,
AGE = RIDAGEYR,
US_MIL = ifelse(DMQMILIZ == 1, 1, 0),
WT_EX = WTMEC2YR,
TOTCHOL = LBXTC) |>
select(SEQN, FEMALE, AGE, TOTCHOL, US_MIL, WT_EX)nh1314 analytic sample AGE TOTCHOL WT_EX
Min. :20.00 Min. : 69 Min. : 8430
1st Qu.:24.00 1st Qu.:156 1st Qu.: 24694
Median :30.00 Median :178 Median : 34642
Mean :29.47 Mean :181 Mean : 44529
3rd Qu.:34.00 3rd Qu.:203 3rd Qu.: 59561
Max. :39.00 Max. :417 Max. :125680
US_MIL
FEMALE 0 1 Total
0 829 45 874
1 921 6 927
Total 1750 51 1801
df_stats with gt()df_stats(~ AGE + TOTCHOL, data = nh1314) |>
mutate(across(mean:sd, ~ round_half_up(.x, 2))) |>
gt() |> tab_options(table.font.size = 20) |>
tab_header(title = "Approach A",
subtitle = "Data from nh1314 sample, unadjusted")| Approach A | |||||||||
| Data from nh1314 sample, unadjusted | |||||||||
| response | min | Q1 | median | Q3 | max | mean | sd | n | missing |
|---|---|---|---|---|---|---|---|---|---|
| AGE | 20 | 24 | 30 | 34 | 39 | 29.47 | 5.80 | 1801 | 0 |
| TOTCHOL | 69 | 156 | 178 | 203 | 417 | 181.01 | 37.41 | 1801 | 0 |
df_stats with gt()df_stats(~ AGE + TOTCHOL, data = nh1314) |>
gt() |> fmt_number(columns = mean:sd, decimals = 2) |>
tab_options(table.font.size = 20) |>
tab_header(title = "Approach B",
subtitle = "Data from nh1314 sample, unadjusted")| Approach B | |||||||||
| Data from nh1314 sample, unadjusted | |||||||||
| response | min | Q1 | median | Q3 | max | mean | sd | n | missing |
|---|---|---|---|---|---|---|---|---|---|
| AGE | 20 | 24 | 30 | 34 | 39 | 29.47 | 5.80 | 1801 | 0 |
| TOTCHOL | 69 | 156 | 178 | 203 | 417 | 181.01 | 37.41 | 1801 | 0 |
nh1314 analytic sample: WeightsEach weight represents the number of people exemplified by that subject.
gtsummary() to describe nh1314 (unweighted)| Characteristic | N = 1,8011 |
|---|---|
| FEMALE | 927 (51%) |
| AGE | 30.0 (24.0, 34.0) |
| TOTCHOL | 178 (156, 203) |
| US_MIL | 51 (2.8%) |
| WT_EX | 34,642 (24,694, 59,561) |
| 1 n (%); Median (IQR) | |
See https://www.danieldsjoberg.com/gtsummary/ for more options.
nh_design survey designWhat is the mean of total cholesterol, overall and in groups?
mean SE
TOTCHOL 181.25 1.0172
FEMALE TOTCHOL se
0 0 182.6313 1.515072
1 1 179.8881 1.359801
FEMALE US_MIL TOTCHOL se
0.0 0 0 182.3569 1.575994
1.0 1 0 180.0248 1.368408
0.1 0 1 186.6966 5.354835
1.1 1 1 164.1984 6.535223
Mean of total cholesterol within groups with 90% CI?
grouped_result <- svyby(~ TOTCHOL, ~ FEMALE + US_MIL,
nh_design, svymean, na.rm = TRUE)
coef(grouped_result) 0.0 1.0 0.1 1.1
182.3569 180.0248 186.6966 164.1984
5 % 95 %
0.0 179.7646 184.9492
1.0 177.7739 182.2756
0.1 177.8887 195.5045
1.1 153.4489 174.9478
se(grouped_result), too.resres <- tibble(
type = rep(c("Unweighted", "Survey-Weighted"),4),
female = c(rep("Female", 4), rep("Male", 4)),
us_mil = rep(c("Military", "Military", "Not Military", "Not Military"), 2),
MEAN = c(169.5, 164.1984, 179.71, 180.0248, 187.11, 186.6966, 182.22, 182.3569) )
res |> gt()| type | female | us_mil | MEAN |
|---|---|---|---|
| Unweighted | Female | Military | 169.5000 |
| Survey-Weighted | Female | Military | 164.1984 |
| Unweighted | Female | Not Military | 179.7100 |
| Survey-Weighted | Female | Not Military | 180.0248 |
| Unweighted | Male | Military | 187.1100 |
| Survey-Weighted | Male | Military | 186.6966 |
| Unweighted | Male | Not Military | 182.2200 |
| Survey-Weighted | Male | Not Military | 182.3569 |
TOTCHOL in nh1314First, we’ll ignore weighting, and fit a model with main effects of all three predictors (mod1), then a model (mod2) which incorporates an interaction of FEMALE and US_MIL.
The interaction term means that the effect of FEMALE on TOTCHOL depends on the US_MIL status.
mod1, unweightedtidy(mod1, conf.int = TRUE, conf.level = 0.90) |>
select(-statistic) |> gt() |> tab_options(table.font.size = 20)| term | estimate | std.error | p.value | conf.low | conf.high |
|---|---|---|---|---|---|
| (Intercept) | 136.345657 | 4.4915861 | 9.534649e-164 | 128.953845 | 143.7374695 |
| AGE | 1.571367 | 0.1474222 | 9.149426e-26 | 1.328754 | 1.8139805 |
| FEMALE | -3.312433 | 1.7274350 | 5.532719e-02 | -6.155276 | -0.4695898 |
| US_MIL | 2.003854 | 5.2026231 | 7.001628e-01 | -6.558113 | 10.5658215 |
mod1 Residuals (plots 1, 2)mod1 Residuals (plots 3, 5)mod2, unweightedtidy(mod2, conf.int = TRUE, conf.level = 0.90) |>
select(-statistic) |> gt() |> tab_options(table.font.size = 20)| term | estimate | std.error | p.value | conf.low | conf.high |
|---|---|---|---|---|---|
| (Intercept) | 136.299221 | 4.4922913 | 1.340759e-163 | 128.906246 | 143.6921962 |
| AGE | 1.570077 | 0.1474422 | 1.015235e-25 | 1.327431 | 1.8127229 |
| FEMALE | -3.151959 | 1.7380814 | 6.992615e-02 | -6.012324 | -0.2915943 |
| US_MIL | 3.639800 | 5.5547809 | 5.123873e-01 | -5.501717 | 12.7813170 |
| FEMALE:US_MIL | -13.342653 | 15.8650968 | 4.004561e-01 | -39.451882 | 12.7665766 |
mod2 Residuals (plots 1, 2)mod2 Residuals (plots 3, 5)svyglmAgain, we’ll run two models, first without and second with an interaction term between FEMALE and US_MIL.
Gaussian family used to generate linear regressions here.
tidy(glm1_results, conf.int = TRUE, conf.level = 0.90) |>
select(-statistic) |> gt() |> tab_options(table.font.size = 20)| term | estimate | std.error | p.value | conf.low | conf.high |
|---|---|---|---|---|---|
| (Intercept) | 137.1292664 | 5.0039123 | 1.906826e-138 | 128.894318 | 145.36421489 |
| AGE | 1.5646634 | 0.1696597 | 7.889576e-20 | 1.285454 | 1.84387266 |
| FEMALE | -3.2123089 | 2.0091506 | 1.100321e-01 | -6.518772 | 0.09415432 |
| US_MIL | 0.5935502 | 5.0392343 | 9.062506e-01 | -7.699528 | 8.88662816 |
tidy(glm2_results, conf.int = TRUE, conf.level = 0.90) |>
select(-statistic) |> gt() |> tab_options(table.font.size = 20)| term | estimate | std.error | p.value | conf.low | conf.high |
|---|---|---|---|---|---|
| (Intercept) | 136.863878 | 5.0060799 | 6.872607e-138 | 128.625359 | 145.1023957 |
| AGE | 1.567633 | 0.1695865 | 6.517529e-20 | 1.288544 | 1.8467218 |
| FEMALE | -2.868135 | 2.0285450 | 1.575681e-01 | -6.206517 | 0.4702464 |
| US_MIL | 3.426681 | 5.4709976 | 5.311744e-01 | -5.576953 | 12.4303155 |
| FEMALE:US_MIL | -22.065349 | 8.5522325 | 9.956850e-03 | -36.139779 | -7.9909185 |
glm1_resultsglm1_resultsglm2_resglm2_resKey Source: https://wwwn.cdc.gov/nchs/data/tutorials/DB303_Fig1_R.R
Now, we are looking at the percentage of persons aged 20 and over with depression, by age and sex, in the US in 2013-2016. Pull in data using nhanesA…
DEMO_H <- nhanes('DEMO_H', translated = FALSE) |>
select(SEQN, RIAGENDR, RIDAGEYR, SDMVSTRA, SDMVPSU, WTMEC2YR)
DEMO_I <- nhanes('DEMO_I', translated = FALSE) |>
select(SEQN, RIAGENDR, RIDAGEYR, SDMVSTRA, SDMVPSU, WTMEC2YR)
DEMO <- bind_rows(DEMO_H, DEMO_I)
DPQ_H <- nhanes('DPQ_H', translated = FALSE)
DPQ_I <- nhanes('DPQ_I', translated = FALSE)
DPQ <- bind_rows(DPQ_H, DPQ_I) dat2 <- left_join(DEMO, DPQ, by = "SEQN") |> tibble() |>
# Set 7=Refused and 9=Don't Know To NA
mutate(across(.cols = DPQ010:DPQ090,
~ ifelse(. >=7, NA, .))) %>%
mutate(one = 1,
PHQ9_score = rowSums(select(. , DPQ010:DPQ090)),
Depression = ifelse(PHQ9_score >= 10, 100, 0),
Sex = factor(RIAGENDR, labels = c("M", "F")),
Age_group = cut(RIDAGEYR,
breaks = c(-Inf, 19, 39, 59, Inf),
labels = c("Under 20", "20-39", "40-59", "60+")),
WTMEC4YR = WTMEC2YR/2,
inAnalysis = (RIDAGEYR >= 20 & !is.na(PHQ9_score))) |>
select(-starts_with("DPQ"))Here’s the survey design for the overall data set:
NH_des_all <- svydesign(data = dat2, id = ~ SDMVPSU,
strata = ~ SDMVSTRA, weights = ~ WTMEC4YR, nest = TRUE)
dim(NH_des_all)[1] 20146 13
Here’s the survey design object for the subset of interest: adults aged 20 and over with a valid PHQ-9 depression score:
Sex counts Depression se
1 M 4821 5.549344 0.4293217
2 F 5121 10.427654 0.5658239
Age_group counts Depression se
1 20-39 3328 7.744613 0.5236944
2 40-59 3307 8.429826 0.6164284
3 60+ 3307 7.971216 0.7797954
Sex Age_group counts Depression se
1 M 20-39 1654 5.513778 0.6461045
2 F 20-39 1674 10.050321 0.8036891
3 M 40-59 1556 5.222060 0.7699895
4 F 40-59 1751 11.477238 1.2011361
5 M 60+ 1611 6.052782 0.8295114
6 F 60+ 1696 9.579923 1.0534115
Across all age groups:
In people ages 40-59:
Design-based t-test
data: Depression ~ Sex
t = 3.8688, df = 29, p-value = 0.0005706
alternative hypothesis: true difference in mean is not equal to 0
95 percent confidence interval:
2.948407 9.561949
sample estimates:
difference in mean
6.255178
Design-based t-test
data: Depression ~ Age_group
t = 0.79398, df = 29, p-value = 0.4337
alternative hypothesis: true difference in mean is not equal to 0
95 percent confidence interval:
-1.079836 2.450262
sample estimates:
difference in mean
0.6852129
432 Class 03 | 2024-01-23 | https://thomaselove.github.io/432-2024/